NBA Win Predictions by C. Andrew Byrd IV

Introduction

I have recently been reading “Mathletics” by Wayne Winston (http://waynewinston.com/wordpress/?page_id=13). It’s a neat book where this sport enthusiast and math professor walks the reader through how to calculate most any statistic for baseball, basketball, and football. I found his chapters on basketball and the NBA very fun to read. I also wanted to test his thoughts on how to predict wins for NBA teams using only information from the box score. He issues the assertion that you can create a wins model by looking at about 8 different variables regarding everything from effectiveness in the team’s shooting to the other team’s effectiveness in getting to the free throw line.

To create the model, Winston compares a team’s abilities against their opponents (i.e. The team’s EFG% vs the opponent’s EFG%). This is done with EFG, TTP, Rebrate, and FTR by just take the differences between the team’s statistics and their opponents. I’ve created these fields below to be studied within this project.

I’ve outlined all details of the data in the “NBA Data Info”" file.

Variable Analysis

## [1] 1014   18
##  [1] "X"            "Team"         "year"         "W.L."        
##  [5] "Conference"   "MadePlayoffs" "EFG"          "Opp_EFG"     
##  [9] "TTP"          "DTTP"         "ORebRate"     "DRebRate"    
## [13] "FTR"          "OFTR"         "EFG_diff"     "TTP_diff"    
## [17] "RebRate_diff" "FTR_diff"
## 'data.frame':    1014 obs. of  18 variables:
##  $ X           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Team        : Factor w/ 39 levels "Atlanta Hawks",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ year        : int  1981 1985 1990 1992 2000 2001 2002 2003 2004 2005 ...
##  $ W.L.        : num  0.378 0.415 0.5 0.463 0.341 0.305 0.402 0.427 0.341 0.159 ...
##  $ Conference  : Factor w/ 2 levels "East","West": 1 1 1 1 1 1 1 1 1 1 ...
##  $ MadePlayoffs: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ EFG         : num  0.48 0.489 0.496 0.481 0.46 ...
##  $ Opp_EFG     : num  0.498 0.485 0.509 0.497 0.481 ...
##  $ TTP         : num  0.266 0.247 0.219 0.222 0.246 ...
##  $ DTTP        : num  0.27 0.246 0.226 0.221 0.2 ...
##  $ ORebRate    : num  0.336 0.316 0.353 0.323 0.301 ...
##  $ DRebRate    : num  0.642 0.623 0.625 0.654 0.668 ...
##  $ FTR         : num  0.293 0.25 0.277 0.203 0.217 ...
##  $ OFTR        : num  0.295 0.249 0.254 0.209 0.196 ...
##  $ EFG_diff    : num  -0.01755 0.00356 -0.01314 -0.0158 -0.02156 ...
##  $ TTP_diff    : num  -0.003762 0.001072 -0.006916 0.000317 0.04618 ...
##  $ RebRate_diff: num  -0.306 -0.307 -0.272 -0.331 -0.367 ...
##  $ FTR_diff    : num  -0.0017 0.00152 0.02303 -0.00643 0.02137 ...
##  [1] "Atlanta Hawks"                    
##  [2] "Boston Celtics"                   
##  [3] "Brooklyn Nets"                    
##  [4] "Charlotte Bobcats"                
##  [5] "Charlotte Hornets"                
##  [6] "Chicago Bulls"                    
##  [7] "Cleveland Cavaliers"              
##  [8] "Dallas Mavericks"                 
##  [9] "Denver Nuggets"                   
## [10] "Detroit Pistons"                  
## [11] "Golden State Warriors"            
## [12] "Houston Rockets"                  
## [13] "Indiana Pacers"                   
## [14] "Kansas City Kings"                
## [15] "Los Angeles Clippers"             
## [16] "Los Angeles Lakers"               
## [17] "Memphis Grizzlies"                
## [18] "Miami Heat"                       
## [19] "Milwaukee Bucks"                  
## [20] "Minnesota Timberwolves"           
## [21] "New Jersey Nets"                  
## [22] "New Orleans Hornets"              
## [23] "New Orleans Pelicans"             
## [24] "New Orleans/Oklahoma City Hornets"
## [25] "New York Knicks"                  
## [26] "Oklahoma City Thunder"            
## [27] "Orlando Magic"                    
## [28] "Philadelphia 76ers"               
## [29] "Phoenix Suns"                     
## [30] "Portland Trail Blazers"           
## [31] "Sacramento Kings"                 
## [32] "San Antonio Spurs"                
## [33] "San Diego Clippers"               
## [34] "Seattle SuperSonics"              
## [35] "Toronto Raptors"                  
## [36] "Utah Jazz"                        
## [37] "Vancouver Grizzlies"              
## [38] "Washington Bullets"               
## [39] "Washington Wizards"
## [1] "No"  "Yes"
## [1] "East" "West"
##        X                           Team          year     
##  Min.   :   1.0   Atlanta Hawks      : 37   Min.   :1980  
##  1st Qu.: 254.2   Boston Celtics     : 37   1st Qu.:1990  
##  Median : 507.5   Chicago Bulls      : 37   Median :1999  
##  Mean   : 507.5   Cleveland Cavaliers: 37   Mean   :1999  
##  3rd Qu.: 760.8   Denver Nuggets     : 37   3rd Qu.:2008  
##  Max.   :1014.0   Detroit Pistons    : 37   Max.   :2016  
##                   (Other)            :792                 
##       W.L.        Conference MadePlayoffs      EFG        
##  Min.   :0.1060   East:508   No :441      Min.   :0.4242  
##  1st Qu.:0.3780   West:506   Yes:573      1st Qu.:0.4760  
##  Median :0.5120                           Median :0.4892  
##  Mean   :0.5000                           Mean   :0.4894  
##  3rd Qu.:0.6175                           3rd Qu.:0.5012  
##  Max.   :0.8880                           Max.   :0.5622  
##                                                           
##     Opp_EFG            TTP              DTTP           ORebRate     
##  Min.   :0.4226   Min.   :0.1919   Min.   :0.1864   Min.   :0.1804  
##  1st Qu.:0.4769   1st Qu.:0.2255   1st Qu.:0.2251   1st Qu.:0.2549  
##  Median :0.4900   Median :0.2380   Median :0.2372   Median :0.2813  
##  Mean   :0.4892   Mean   :0.2379   Mean   :0.2379   Mean   :0.2813  
##  3rd Qu.:0.5022   3rd Qu.:0.2500   3rd Qu.:0.2497   3rd Qu.:0.3078  
##  Max.   :0.5398   Max.   :0.3089   Max.   :0.2998   Max.   :0.3767  
##                                                                     
##     DRebRate           FTR              OFTR           EFG_diff         
##  Min.   :0.5790   Min.   :0.1456   Min.   :0.1580   Min.   :-0.0752389  
##  1st Qu.:0.6454   1st Qu.:0.2155   1st Qu.:0.2143   1st Qu.:-0.0176898  
##  Median :0.6658   Median :0.2341   Median :0.2338   Median :-0.0005686  
##  Mean   :0.6670   Mean   :0.2356   Mean   :0.2357   Mean   : 0.0001438  
##  3rd Qu.:0.6896   3rd Qu.:0.2554   3rd Qu.:0.2574   3rd Qu.: 0.0195345  
##  Max.   :0.7503   Max.   :0.3344   Max.   :0.3466   Max.   : 0.0824197  
##                                                                         
##     TTP_diff           RebRate_diff        FTR_diff         
##  Min.   :-7.189e-02   Min.   :-0.5482   Min.   :-0.1375773  
##  1st Qu.:-1.521e-02   1st Qu.:-0.4329   1st Qu.:-0.0235689  
##  Median :-3.601e-04   Median :-0.3848   Median : 0.0009545  
##  Mean   : 5.797e-05   Mean   :-0.3857   Mean   :-0.0001364  
##  3rd Qu.: 1.522e-02   3rd Qu.:-0.3357   3rd Qu.: 0.0244218  
##  Max.   : 6.364e-02   Max.   :-0.2376   Max.   : 0.0967220  
## 

What is the structure of your dataset?

The data set has 1,014 observations of NBA teams across 36 years (1980 to 2016). I choose to begin in 1980 because that is the first year there was a 3 point line. Since EFG will become such an important variable (spoiler alert!), I thought it would be better to have 3P Field Goal data in the study. Each line begins with the team’s name (Usually a combination of city and mascot), the year in which the team participated in the NBA, and their winning percentage (Wins / Games Played) for the respective season.

Each observation has variables associated with advanced statistics of basketball performance. Outside of the category fields (i.e. Making the playoffs and a team’s conference), the variables have a pairing of how well the team performs in this area and how well the team’s opponents fair in the same area. This includes variables like:

Effective Field Goal % (EFG) & Opponent Effective Field Goal % (Opp_EFG) Turnovers / Possession (TTP) & Turnovers Caused / Possession (DTTP) Offensive Rebound Rate (ORebRate) & Defensive Rebound Rate (DRebRate) Free Throw Rate (FTR) & Opponent’s Free Throw Rate (OFTR)

Since there is such a pairing among 8 (4 pairings) of my variables, I’ve also included the difference to better display how a team performs against opponents for that skill (i.e. Effective Field Goal % Difference).

Some of my categorical variables include whether a team made the playoffs (post season tournament reserved for top half of teams) and which conference they reside (East or West).

What is/are the main feature(s) of interest in your dataset?

The main features of the data set are Winning % and the 4 pairing team vs opponent difference variables (EFG, TTP, RebRate, and FTR) as I want expand upon Wayne Winston’s work (He only examined 1 year, using Excel) of predicting winning % from the difference variables.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I think it will also be helpful to look at the individual variables (non difference pairs) and the “MadePlayoffs” variable in terms of predicting win percentage. I believe the “MadePlayoffs” variable will be very helpful when graphed to show direction of variable effectiveness (i.e. Using a scatter plot with colored dots based on if the team made the playoffs will show which teams are doing better).

Did you create any new variables from existing variables in the dataset?

I created the difference pair variables. It’s the method Winston determined was the based relationship for measuring wins. The remainder were created by scraping data from Basketball Reference website (Outlined in separate file).

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

It’s very interesting that the mean for the W.L. is .500. I say that because it seems very ideal. It’s not always the case but I can also see that we should have “normal” like graphs as most of the means and medians are very close in values. I was able to scrape the data in the current form (it just needed a few clean up scripts and merging).

Variable Plots

The graphs below offer a quick histogram for each variable. I also broke out each variable within “facet_wrap” to show any difference for playoff teams and non-playoff teams. I thought this would help determine if there are basketball benefits to the variable as going to the playoffs is seen as being success.

W/L%

## nba$MadePlayoffs: No
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1060  0.2930  0.3660  0.3583  0.4390  0.5850 
## -------------------------------------------------------- 
## nba$MadePlayoffs: Yes
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3660  0.5370  0.6100  0.6091  0.6710  0.8880

The W.L.% has a somewhat normal distribution. It’s also obvious that teams with a higher winning percentage are more likely to be in the playoffs.

EFG (Effective Field Goal Percentage)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4242  0.4760  0.4892  0.4894  0.5012  0.5622

## nba$MadePlayoffs: No
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4242  0.4693  0.4792  0.4793  0.4913  0.5451 
## -------------------------------------------------------- 
## nba$MadePlayoffs: Yes
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4355  0.4853  0.4964  0.4971  0.5086  0.5622

The 2nd graph kind of make me think better shooting teams are more likely to be in the playoffs. The range seems kind of small at .42 to .56. I figured some teams would be worse.

Opp_EFG (Opps Effective Field Goal Percentage)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4226  0.4769  0.4900  0.4892  0.5022  0.5398

## nba$MadePlayoffs: No
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4453  0.4888  0.4995  0.4991  0.5109  0.5398 
## -------------------------------------------------------- 
## nba$MadePlayoffs: Yes
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4226  0.4706  0.4823  0.4816  0.4930  0.5278

I think the 2nd graph paints a similar picture to before. This time it leads us to think that better defenses will get in the playoffs. Again, the range is quite small (smaller than EFG) at .42 to .54.

EFG_diff

##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -0.0752400 -0.0176900 -0.0005686  0.0001438  0.0195300  0.0824200

## nba$MadePlayoffs: No
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.075240 -0.031800 -0.018650 -0.019820 -0.007291  0.036250 
## -------------------------------------------------------- 
## nba$MadePlayoffs: Yes
##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -3.740e-02  1.251e-05  1.383e-02  1.551e-02  3.063e-02  8.242e-02

It’s great to see the normal distribution as that should help with the linear model. There does seem to be a big effect between playoff and non-playoff teams if you look at the graph.

TTP (Team Turnovers / Possession)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1919  0.2255  0.2380  0.2379  0.2500  0.3089

## nba$MadePlayoffs: No
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2010  0.2336  0.2451  0.2452  0.2562  0.3089 
## -------------------------------------------------------- 
## nba$MadePlayoffs: Yes
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1919  0.2208  0.2319  0.2324  0.2432  0.2929

Teams should expect to turn the ball over between 1/5 to 1/3 of their possessions. The difference doesn’t seem to be as big between playoff and non-playoff teams as compared to EFG.

DTTP (Defensive (Caused) Turnovers / Possession)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1864  0.2251  0.2372  0.2379  0.2497  0.2998

## nba$MadePlayoffs: No
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1902  0.2214  0.2313  0.2323  0.2426  0.2928 
## -------------------------------------------------------- 
## nba$MadePlayoffs: Yes
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1864  0.2293  0.2424  0.2422  0.2540  0.2998

Defense seems to cause fewer turnovers than opponents have turnovers, ranging from .19 to .3. I think this is explainable as there could be “unforced” turnovers for the offense.

TTP_diff

##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -7.189e-02 -1.521e-02 -3.601e-04  5.797e-05  1.522e-02  6.364e-02

## nba$MadePlayoffs: No
##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -0.0417100  0.0003523  0.0128800  0.0128600  0.0247600  0.0636400 
## -------------------------------------------------------- 
## nba$MadePlayoffs: Yes
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.071890 -0.021630 -0.010150 -0.009796  0.002714  0.044160

Another very “normalish” variable to use for linear model building.

ORebRate (Offensive Rebound Rate)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1804  0.2549  0.2813  0.2813  0.3078  0.3767

## nba$MadePlayoffs: No
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1940  0.2528  0.2768  0.2777  0.3038  0.3767 
## -------------------------------------------------------- 
## nba$MadePlayoffs: Yes
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1804  0.2572  0.2853  0.2841  0.3120  0.3693

Teams will expect to grab their off 18% to 38% of their offensive rebounds. Visually, I’m not sure I can tell a difference between playoff and non-playoff teams for this variable.

DRebRate (Defensive Rebound Rate)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.5790  0.6454  0.6658  0.6670  0.6896  0.7503

## nba$MadePlayoffs: No
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.5790  0.6435  0.6647  0.6652  0.6879  0.7454 
## -------------------------------------------------------- 
## nba$MadePlayoffs: Yes
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.5889  0.6468  0.6664  0.6685  0.6907  0.7503

Teams appear to grab their defensive rebounds at a much higher rate. The range is about .56 to .75. I have heard of strategies were the offense will not go for rebounds as much but instead, get back early (after the initial shot) to play defense.

RebRate_diff

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.5482 -0.4329 -0.3848 -0.3857 -0.3357 -0.2376

## nba$MadePlayoffs: No
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.5406 -0.4321 -0.3869 -0.3874 -0.3392 -0.2587 
## -------------------------------------------------------- 
## nba$MadePlayoffs: Yes
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.5482 -0.4334 -0.3826 -0.3843 -0.3333 -0.2376

This variable is much less normal. I almost want to say it has a bi-modal shape to it.

FTR (Free Throws Made / Field Goal Att)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1456  0.2155  0.2341  0.2356  0.2554  0.3344

## nba$MadePlayoffs: No
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1456  0.2093  0.2247  0.2282  0.2457  0.3209 
## -------------------------------------------------------- 
## nba$MadePlayoffs: Yes
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1612  0.2218  0.2407  0.2413  0.2616  0.3344

Teams should expect their “Free Throw Rate” to be between .15 and .33. The IQR is quite small at only about .04.

OFTR (Opp. Free Throws Made / Opp. Field Goal Att)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1580  0.2143  0.2338  0.2357  0.2574  0.3466

## nba$MadePlayoffs: No
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1754  0.2180  0.2394  0.2411  0.2610  0.3466 
## -------------------------------------------------------- 
## nba$MadePlayoffs: Yes
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1580  0.2105  0.2283  0.2316  0.2519  0.3295

Very similar story with OFTR as compared to FTR.

FTR_diff

##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -0.1376000 -0.0235700  0.0009545 -0.0001364  0.0244200  0.0967200

## nba$MadePlayoffs: No
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.13760 -0.03404 -0.01354 -0.01289  0.01055  0.08248 
## -------------------------------------------------------- 
## nba$MadePlayoffs: Yes
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.085370 -0.012600  0.006060  0.009676  0.032740  0.096720

Another normal looking variable for the linear model.

Almost all, with the exception being the rebounding graphs, our data have a fairly normal look to it. The rebounding data is still somewhat normal but I think it almost has a bi-modal distribution. This is really great news as it allows us to continue with the process of building a model where these variables could predict the win % for a given team.

Correlation Analysis

Since I hope to make a Win% Prediction model, I believe my bi-variate analysis and plots should focus on each variable’s relationship with the Win% variable to better understand their importance to the model. I also think I should ensure model viability by making sure the remaining variables are not too correlated with each other.

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

I found that the “difference” variables of EFG, TTP, and FTR definitely have a relationship with winning %. However, the RebRate difference variable did not have much of a relationship. This is quite surprising as I thought this would be one of the biggest factors of winning basketball games.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

It’s very important to note that the four main difference pair do not have much correlation with each other. This is very important for the linear regression model assumptions.

What was the strongest relationship you found?

As far as I can tell, the EFG difference variable vs winning % was the strongest relationship with a correlation of 85%. This makes a lot of sense. It’s extremely easy to say that teams that shoot well and keep opponents from shooting well, should win a lot of basketball games.

Correlation Plots

The first 6 plots below show the strongest relationships with Win%. The narrative of it makes a lot of sense, if you shoot well, the other teams don’t shoot well, and you don’t turn the ball over, you are likely to win games.

EFG vs W.L.

## [1] 0.6077424

Fairly strong relationship. The better a team can shoot effectively, the more games they should win.

Opp_EFG vs W.L.

## [1] -0.5704137

The variance looks a little larger but there is still a relationship.

EFG_diff vs W.L.

## [1] 0.8452894

This is my strongest relationship in the whole study. As mentioned early, for very obvious reasons. Adding the MadePlayoffs variable to the colors helps strengthen the visual of the relationship.

TTP vs W.L.

## [1] -0.4608441

TTP_diff vs W.L.

## [1] -0.5925833

Another fairly strong relationship that indicates that it is better to create turnovers than commit them.

FTR_diff vs W.L.

## [1] 0.4066263

My correlation table shows that there is some correlation for these two variables but it is very difficult to see.

The next two plots of this section show the strongest relationships among non-Win% variables. They both make a lot of sense. In the first example, it’s easy to explain that teams that shoot well don’t have as many opportunities to turn the ball over. They are playing well and probably aren’t as likely to turn the ball over. In the second example, is a little more difficult to explain away but I could see a situation where teams that go for offensive rebounds are more likely to get fouled (i.e. big guys “hitting the boards” against the defense).

EFG vs TTP

## [1] -0.3973709

Adding the playoff coloring doesn’t clear the relationship. In general, it’s just not a strong relationship. I put the 3rd graph in to help clear up the relationship. I was hoping to see large circles on the left and small circles on the right. This might say teams that are effective at shooting aren’t as likely to turn the ball over.

FTR vs ORebRate

## [1] 0.3602241

The playoffs factor doesn’t seem to have much effect as there are green and red dots throughout the graph. Again, I created the 3rd graph to see if this would help paint the story of teams that are going for offensive rebounds are more likely to get fouled. I wanted to see small dots on the left and large ones on the right but I’m not sure it’s showing that completely.

The plot below is very difficult to understand. As I don’t know why RebRate_diff isn’t correlated with Winning %. I have always understood rebounding better than your opponent leads to winning the game. ## RebRate_diff vs W.L.

## [1] -0.02916389

I’ve also included a few category variable plots.

W.L. vs MadePlayoffs

W.L. vs Conference

Linear Regression Model

Creating the Model

## 
## Calls:
## m1: lm(formula = W.L. ~ EFG_diff, data = nba)
## m2: lm(formula = W.L. ~ EFG_diff + TTP_diff, data = nba)
## m3: lm(formula = W.L. ~ EFG_diff + TTP_diff + RebRate_diff, data = nba)
## m4: lm(formula = W.L. ~ EFG_diff + TTP_diff + RebRate_diff + FTR_diff, 
##     data = nba)
## 
## ==============================================================
##                      m1         m2         m3         m4      
## --------------------------------------------------------------
##   (Intercept)      0.499***   0.500***   0.541***   0.527***  
##                   (0.003)    (0.002)    (0.013)    (0.011)    
##   EFG_diff         4.890***   4.242***   4.270***   3.880***  
##                   (0.097)    (0.076)    (0.076)    (0.071)    
##   TTP_diff                   -2.711***  -2.696***  -2.833***  
##                              (0.096)    (0.096)    (0.085)    
##   RebRate_diff                           0.107**    0.070*    
##                                         (0.033)    (0.029)    
##   FTR_diff                                          0.861***  
##                                                    (0.051)    
## --------------------------------------------------------------
##   R-squared           0.71       0.84       0.84       0.88   
##   adj. R-squared      0.71       0.84       0.84       0.88   
##   sigma               0.08       0.06       0.06       0.05   
##   F                2532.83    2652.41    1788.32    1788.96   
##   p                   0.00       0.00       0.00       0.00   
##   Log-likelihood   1082.58    1375.90    1381.13    1507.10   
##   Deviance            7.02       3.94       3.89       3.04   
##   AIC             -2159.15   -2743.81   -2752.26   -3002.19   
##   BIC             -2144.39   -2724.12   -2727.65   -2972.66   
##   N                1014       1014       1014       1014      
## ==============================================================

Attach Predictions to DF

## [1] 0.04402368

Looks like the model predicts winning % with only a 4.4% absolute error rate! Not too bad.

Did you create any models with your dataset? Discuss the strengths and limitations of your model.

Yes. The model is able to account for 88% of the variance in Winning percentage of teams each year. This is based on using the 4 pairing difference variables as inputs. This models allows one to fairly accurate predict the winning % based on those four variables. The most important factor, by far, is the EFG_diff variable as using that variable alone I am able to account for 71% of the variance.


Final Plots and Summary

Plot One - Histogram of Winning %, Broken out by Playoff & Non-Playoff Teams

Description One

I believe this charts states the most obvious statement. If your team wins basketball games, your team is more likely to make it to the playoffs. The colors really complete the picture as they are quite differentiated from the left and right. However, I will note there does appear to be instances of losing teams (teams with a Win% below 50%) making the playoffs and winning teams (teams with a Win% above 50%) not making the playoffs.

Plot Two - EFG Difference vs Winning %

Description Two

This chart helps display the strongest variable for the linear regression model. The EFG Difference variable controls about 71% of the variance of predicting winning %. I’ve also added some playoff coloring to help show the effectiveness of the variable.

Plot Three

Description Three

This graph is not critical to exploring the main goal of the project (Predicting Win%). However, I did want to give any noticeable relationship (In this case, about .4) an investigation. I was trying to better understand why this relationship is present. My first thought is that teams that shoot well are probably just playing good basketball and not turning the ball over as much. If that is truly the case, then we should see large dots in the lower left and smaller dots in the upper right. I think this happens some but it is nowhere near air tight.


Reflection

This study was definitely a learning experience. It began by reading Winston’s chapters on quantifying wins within NBA, became slightly difficult scraping the data from Basketball Reference, and then became very tedious building graph after graph.

In Winston’s chapter on basketball, he only looked at one years worth of basketball statistics to calculate wins. I knew I could expand that study to as many years as they had relevant data but I would need to change from wins to winning % as some years don’t have the same amount of games played. Going from there I really had to dive in on the “rvest” package to learn more about web scraping in R (I’ve already completed the Data Wrangling Course and I think it helped a lot). I found that the data was mostly clean except for slight variations that occurs over the years at the Basketball Reference site (i.e. Some years they label 3P% as 3P, other it is labeled correctly.)

Once I had all my data, I originally only had the mindset of replicating the Winston linear model. I wanted to test the winning percentage model and call it a day. However, I decided to go along with the project outline and I found myself looking at the data in other ways that led to other questions I could be asking (i.e. Why do teams have less turnovers when they shoot well?). By the time I reached the conclusion, I didn’t even care about the original model because I wanted to investigate the other questions. I feel the project definitely opened my eyes about how much we can miss out on if we don’t give the exploratory data analysis the attention and detail it deserves.